NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

TreeTerminus —creating transcript trees using inferential replicate counts

https://doi.org/10.1016/j.isci.2023.106961

Singh, Noor Pratap; Love, Michael I.; Patro, Rob (June 2023, iScience)

Full Text Available
SEESAW: detecting isoform-level allelic imbalance accounting for inferential uncertainty

https://doi.org/10.1186/s13059-023-03003-x

Wu, Euphy Y.; Singh, Noor P.; Choi, Kwangbom; Zakeri, Mohsen; Vincent, Matthew; Churchill, Gary A.; Ackert-Bicknell, Cheryl L.; Patro, Rob; Love, Michael I. (July 2023, Genome Biology)

Abstract Detecting allelic imbalance at the isoform level requires accounting for inferential uncertainty, caused by multi-mapping of RNA-seq reads. Our proposed method, SEESAW, uses Salmon and Swish to offer analysis at various levels of resolution, including gene, isoform, and aggregating isoforms to groups by transcription start site. The aggregation strategies strengthen the signal for transcripts with high uncertainty. The SEESAW suite of methods is shown to have higher power than other allelic imbalance methods when there is isoform-level allelic imbalance. We also introduce a new test for detecting imbalance that varies across a covariate, such as time.
more » « less
Airpart: Interpretable statistical models for analyzing allelic imbalance in single-cell datasets

https://doi.org/10.1093/bioinformatics/btac212

Mu, Wancen; Sarkar, Hirak; Srivastava, Avi; Choi, Kwangbom; Patro, Rob; Love, Michael I (April 2022, Bioinformatics)
Kendziorski, Christina (Ed.)
Abstract Motivation Allelic expression analysis aids in detection of cis-regulatory mechanisms of genetic variation which produce allelic imbalance (AI) in heterozygotes. Measuring AI in bulk data lacking time or spatial resolution has the limitation that cell-type-specific (CTS), spatial-, or time-dependent AI signals may be dampened or not detected. Results We introduce a statistical method airpart for identifying differential CTS AI from single-cell RNA-sequencing (scRNA-seq) data, or other spatially- or time-resolved datasets. airpart outputs discrete partitions of data, pointing to groups of genes and cells under common mechanisms of cis-genetic regulation. In order to account for low counts in single-cell data, our method uses a Generalized Fused Lasso with Binomial likelihood for partitioning groups of cells by AI signal, and a hierarchical Bayesian model for AI statistical inference. In simulation, airpart accurately detected partitions of cell types by their AI and had lower RMSE of allelic ratio estimates than existing methods. In real data, airpart identified DAI patterns across cell states and could be used to define trends of AI signal over spatial or time axes. Availability The airpart package is available as an R/Bioconductor package at https://bioconductor.org/packages/airpart.
more » « less
Full Text Available
Compression of quantification uncertainty for scRNA-seq counts

https://doi.org/10.1093/bioinformatics/btab001

Van Buren, Scott; Sarkar, Hirak; Srivastava, Avi; Rashid, Naim U; Patro, Rob; Love, Michael I (January 2021, Bioinformatics)
Birol, Inanc (Ed.)
Abstract Motivation Quantification estimates of gene expression from single-cell RNA-seq (scRNA-seq) data have inherent uncertainty due to reads that map to multiple genes. Many existing scRNA-seq quantification pipelines ignore multi-mapping reads and therefore underestimate expected read counts for many genes. alevin accounts for multi-mapping reads and allows for the generation of ‘inferential replicates’, which reflect quantification uncertainty. Previous methods have shown improved performance when incorporating these replicates into statistical analyses, but storage and use of these replicates increases computation time and memory requirements. Results We demonstrate that storing only the mean and variance from a set of inferential replicates (‘compression’) is sufficient to capture gene-level quantification uncertainty, while reducing disk storage to as low as 9% of original storage, and memory usage when loading data to as low as 6%. Using these values, we generate ‘pseudo-inferential’ replicates from a negative binomial distribution and propose a general procedure for incorporating these replicates into a proposed statistical testing framework. When applying this procedure to trajectory-based differential expression analyses, we show false positives are reduced by more than a third for genes with high levels of quantification uncertainty. We additionally extend the Swish method to incorporate pseudo-inferential replicates and demonstrate improvements in computation time and memory usage without any loss in performance. Lastly, we show that discarding multi-mapping reads can result in significant underestimation of counts for functionally important genes in a real dataset. Availability and implementation makeInfReps and splitSwish are implemented in the R/Bioconductor fishpond package available at https://bioconductor.org/packages/fishpond. Analyses and simulated datasets can be found in the paper’s GitHub repo at https://github.com/skvanburen/scUncertaintyPaperCode. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Terminus enables the discovery of data-driven, robust transcript groups from RNA-seq data

https://doi.org/10.1093/bioinformatics/btaa448

Sarkar, Hirak; Srivastava, Avi; Bravo, Héctor Corrada; Love, Michael I; Patro, Rob (July 2020, Bioinformatics)

Abstract Motivation Advances in sequencing technology, inference algorithms and differential testing methodology have enabled transcript-level analysis of RNA-seq data. Yet, the inherent inferential uncertainty in transcript-level abundance estimation, even among the most accurate approaches, means that robust transcript-level analysis often remains a challenge. Conversely, gene-level analysis remains a common and robust approach for understanding RNA-seq data, but it coarsens the resulting analysis to the level of genes, even if the data strongly support specific transcript-level effects. Results We introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. Our approach, implemented in the tool terminus, groups together transcripts in a data-driven manner allowing transcript-level analysis where it can be confidently supported, and deriving transcriptional groups where the inferential uncertainty is too high to support a transcript-level result. Availability and implementation Terminus is implemented in Rust, and is freely available and open source. It can be obtained from https://github.com/COMBINE-lab/Terminus. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Alignment and mapping methodology influence transcript abundance estimation

https://doi.org/10.1186/s13059-020-02151-8

Srivastava, Avi; Malik, Laraib; Sarkar, Hirak; Zakeri, Mohsen; Almodaresi, Fatemeh; Soneson, Charlotte; Love, Michael I.; Kingsford, Carl; Patro, Rob (December 2020, Genome Biology)
null (Ed.)
Abstract Background The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. Results We investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment. Conclusion We observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly, and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.
more » « less
Full Text Available
Nonparametric expression analysis using inferential replicate counts

https://doi.org/10.1093/nar/gkz622

Zhu, Anqi; Srivastava, Avi; Ibrahim, Joseph G; Patro, Rob; Love, Michael I (August 2019, Nucleic Acids Research)

Abstract A primary challenge in the analysis of RNA-seq data is to identify differentially expressed genes or transcripts while controlling for technical biases. Ideally, a statistical testing procedure should incorporate the inherent uncertainty of the abundance estimates arising from the quantification step. Most popular methods for RNA-seq differential expression analysis fit a parametric model to the counts for each gene or transcript, and a subset of methods can incorporate uncertainty. Previous work has shown that nonparametric models for RNA-seq differential expression may have better control of the false discovery rate, and adapt well to new data types without requiring reformulation of a parametric model. Existing nonparametric models do not take into account inferential uncertainty, leading to an inflated false discovery rate, in particular at the transcript level. We propose a nonparametric model for differential expression analysis using inferential replicate counts, extending the existing SAMseq method to account for inferential uncertainty. We compare our method, Swish, with popular differential expression analysis methods. Swish has improved control of the false discovery rate, in particular for transcripts with high inferential uncertainty. We apply Swish to a single-cell RNA-seq dataset, assessing differential expression between sub-populations of cells, and compare its performance to the Wilcoxon test.
more » « less
Full Text Available
Tximeta: Reference sequence checksums for provenance identification in RNA-seq

https://doi.org/10.1371/journal.pcbi.1007664

Love, Michael I.; Soneson, Charlotte; Hickey, Peter F.; Johnson, Lisa K.; Pierce, N. Tessa; Shepherd, Lori; Morgan, Martin; Patro, Rob; Pertea, Mihaela (February 2020, PLOS Computational Biology)

Full Text Available
A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs

https://doi.org/10.26508/lsa.201800175

Soneson, Charlotte; Love, Michael I; Patro, Rob; Hussain, Shobbir; Malhotra, Dheeraj; Robinson, Mark D (January 2019, Life Science Alliance)

Most methods for statistical analysis of RNA-seq data take a matrix of abundance estimates for some type of genomic features as their input, and consequently the quality of any obtained results is directly dependent on the quality of these abundances. Here, we present the junction coverage compatibility score, which provides a way to evaluate the reliability of transcript-level abundance estimates and the accuracy of transcript annotation catalogs. It works by comparing the observed number of reads spanning each annotated splice junction in a genomic region to the predicted number of junction-spanning reads, inferred from the estimated transcript abundances and the genomic coordinates of the corresponding annotated transcripts. We show that although most genes show good agreement between the observed and predicted junction coverages, there is a small set of genes that do not. Genes with poor agreement are found regardless of the method used to estimate transcript abundances, and the corresponding transcript abundances should be treated with care in any downstream analyses.
more » « less
Full Text Available
RNA Sequencing Data: Hitchhiker's Guide to Expression Analysis

https://doi.org/10.1146/annurev-biodatasci-072018-021255

Van den Berge, Koen; Hembach, Katharina M.; Soneson, Charlotte; Tiberi, Simone; Clement, Lieven; Love, Michael I.; Patro, Rob; Robinson, Mark D. (July 2019, Annual Review of Biomedical Data Science)

Gene expression is the fundamental level at which the results of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq data sets, as well as the performance of the myriad of methods developed. In this review, we give an overview of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on the quantification of gene expression and statistical approachesfor differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.
more » « less
Full Text Available

« Prev Next »

Search for: All records